Talk to your slide deck (Multimodal RAG) using foundation models (FMs) hosted on Amazon Bedrock – Part 2

Amit Arora, Archana Inapudi, Manju Prasad, Antara Raisa

In part 1 of this series, we presented a solution that used Amazon Titan Multimodal Embeddings model to convert individual slides from a slide deck into embeddings. We stored the embeddings in a vector database and then used the Large Language-and-Vision Assistant (LLaVA 1.5-7b) model to generate text responses to user questions based on the most similar slide retrieved from the vector database. We used AWS services including Amazon Bedrock, Amazon SageMaker, and Amazon OpenSearch Serverless in this solution.

In part 2 of this series, we demonstrate a different approach. We use Anthropic Claude 3 Sonnet model to generate text descriptions for each slide in the slide deck. These descriptions are then converted into text embeddings using Amazon Titan Text Embeddings model and stored in a vector database. Then we use the Claude 3 Sonnet model to generate answers to user questions based on the most relevant text description retrieved from the vector database.

You can test both approches for your dataset and evaluate the results to see which approach gives you the best results. Evaluation of the results is a topic that we will explore in part 3 of this series.

Solution overview

The solution presented provides an implementation for answering questions using information contained in text and visual elements of a slide deck. The design relies on the concept of Retrieval Augmented Generation (RAG). Traditionally, RAG has been associated with textual data that can be processed by LLMs. In this series, we extend RAG to include images as well. This provides a powerful search capability to extract contextually relevant content from visual elements like tables and graphs along with text.

This solution includes the following components:

  • Amazon Titan Text Embeddings is a text embeddings model that converts natural language text including single words, phrases, or even large documents, into numerical representations that can be used to power use cases such as search, personalization, and clustering based on semantic similarity.
  • Anthropic Claude 3 Sonnet
  • Amazon OpenSearch Service Serverless is an on-demand serverless configuration for Amazon OpenSearch Service. We use OpenSearch Service Serverless as a vector database for storing embeddings generated by the Titan Multimodal Embeddings model. An index created in the OpenSearch Service Serverless collection serves as the vector store for our RAG solution.
  • Amazon OpenSearch Ingestion (OSI) is a fully managed, serverless data collector that delivers data to Amazon OpenSearch Service domains and OpenSearch Serverless collections. In this blog, we use an OSI pipeline API to deliver data to the OpenSearch Serverless vector store.

Solution design

The solution design consists of two parts - Ingestion and User interaction. During ingestion, we process the input slide deck by converting each slide into an image, generating descriptions and embeddings for each image. We then populate the vector data store with the slide embeddings and descriptions. These steps are completed prior to the user interaction steps.

In the User interaction phase, a question from the user is converted into embeddings and a similarity search is run on the vector database to find a slide that could potentially contain answers to user question. We then provide the slide description and the user question to the Claude 3 Sonnet model to generate an answer to the query. All the code for this post is available in the GitHub repo.

Ingestion steps:

Figure 1: Ingestion architecture
  1. Slides are converted to image files (one per slide) in the JPG format and passed to Claude 3 Sonnet model and then to Titan Text Embeddings model to generate embeddings. In this series, we use slide deck titled Train and deploy Stable Diffusion using AWS Trainium & AWS Inferentia from the AWS Summit in Toronto, June 2023 to demonstrate the solution.

    • The sample deck has 31 slides and thus we generate 31 sets of vector embeddings, each with 1536 dimensions. We add additional metadata fields to perform rich search queries using OpenSearch’s powerful search capabilities.
  2. The embeddings are ingested into OSI pipeline via an API call, the OSI pipeline in turn ingests the data as documents into the OpenSearch Service Serverless index.

    • Note that the OpenSearch Service Serverless index is configured as the sink for this pipeline and it is created as part of the OpenSearch Service Serverless collection.

User interaction steps:

Figure 2: User interaction architecture
  1. A user submits a question related to the slide deck that has been ingested.
  2. The user input is converted into embeddings using the Titan Text Embeddings model accessed via Bedrock. An OpenSearch vector search is performed using these embeddings. We perform a K-Nearest Neighbor (k=1) search to retrieve the most relevant embedding matching the user query. Setting k=1 retrieves the most relevant slide to the user question.
  3. The metadata of the response from OpenSearch Services Serverless contains a path to the image and description corresponding to the most relevant slide.
  4. A prompt is created by combining the user question and the image description. The prompt is provided to Claude 3 Sonnet hosted on SageMaker.
  5. Result of this inference is returned to the user.

These steps are discussed in detail in the following sections. See Results section for screenshots and details on the output.

Prerequisites

To implement the solution provided in this post, you should have an AWS account and familarity with FMs, Bedrock, SageMaker, and OpenSearch Service.

This solution uses the Claude 3 Sonnet and Titan Text embeddings models hosted on Amazon Bedrock. Ensure that these models are enabled for use in Amazon Bedrock. In AWS Management Console → Amazon Bedrock, select Model access. If models are enabled, the Access status will state “Access granted” as below.

Figure 3: Model access

If the models are not available, enable access by clicking on “Manage Model Access”, selecting “Titan Embeddings G1 - Text” and “Claude 3 Sonnet” and clicking on Request model access. The models are enabled for use immediately.

Use AWS CloudFormation template to create the solution stack

AWS Region Link
us-east-1
us-west-2

After the stack is created successfully, navigate to the stack’s Outputs tab on the AWS CloudFormation console and note the values for MultimodalCollectionEndpoint and OpenSearchPipelineEndpoint, we will use it in the subsequent steps.

Figure 4: CloudFormation stack outputs

The CloudFormation template creates the following resources:

  • IAM roles: the following two IAM roles are created. Update these roles to apply least-privilege permissions as discussed in Security best practices.
    • SMExecutionRole with S3, SageMaker, OpenSearch Service, and Bedrock full access.
    • OSPipelineExecutionRole with access to the S3 bucket and OSI actions.
  • SageMaker Notebook: all code for this post is run via this notebook.
  • OpenSearch Service Serverless collection: vector database for storing and retrieving embeddings.
  • OSI Pipeline: pipeline for ingesting data into OpenSearch Service Serverless.
  • S3 bucket: all data for this post is stored in this bucket.

OSI pipeline setup

The CloudFormation template sets up the Pipeline configuration required to setup OSI Pipeline with http source and OpenSearch Serverless index as sink. The SageMaker notebook 2_data_ingestion.ipynb displays how to ingest data into the pipeline using the requests HTTP library.

The CloudFormation template also creates Network, Encryption and Data Access policies required for OpenSearch Serverless Collection. Update these policies to apply least-privilege permissions as discussed in Security best practices.

Note that the CloudFormation template name and OpenSearch Service index name are referenced in the SageMaker notebook 3_rag_inference.ipynb. If the default names are changed, make sure you update the same in the notebook.

Testing the solution

Once the prerequisite steps are complete and the CloudFormation stack has been created successfully, we are now ready to run the “talk to your slide deck” implementation:

  1. On the SageMaker console, choose Notebooks in the navigation pane.

  2. Select the MultimodalNotebookInstance and choose Open JupyterLab.

    Figure 5: SageMaker Notebooks
  3. In File Browser, traverse to the notebooks folder to see notebooks and supporting files. The notebooks are numbered in sequence of execution. Instructions and comments in each notebook describe the actions performed by that notebook. We will run these notebooks one by one.

  4. Choose 1_data_prep.ipynb to open it in JupyterLab. When the notebook is open, on the Run menu, choose Run All Cells to run the code in this notebook. This notebook will download a publicly available slide deck and convert each slide into the JPG file format and upload these to the S3 bucket for this blog.

  5. Next Choose 2_data_ingestion.ipynb to open it in JupyterLab. When the notebook is open, on the Run menu, choose Run All Cells to run the code in this notebook. We do the following in this notebook:

    • Create an index in the OpenSearch Service Serverless collection. This index stores the embeddings data for the slide deck.
    session = boto3.Session()
    credentials = session.get_credentials()
    auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)
    
    os_client = OpenSearch(
      hosts = [{'host': host, 'port': 443}],
      http_auth = auth,
      use_ssl = True,
      verify_certs = True,
      connection_class = RequestsHttpConnection,
      pool_maxsize = 20
    )
    
    index_body = """
    {
      "settings": {
        "index.knn": true
      },
      "mappings": {
        "properties": {
          "vector_embedding": {
            "type": "knn_vector",
            "dimension": 1536,
            "method": {
              "name": "hnsw",
              "engine": "nmslib",
              "parameters": {}
            }
          },
          "image_path": {
            "type": "text"
          },
          "slide_text": {
            "type": "text"
          },
          "slide_number": {
            "type": "text"
          },
          "metadata": { 
            "properties" :
              {
                "filename" : {
                  "type" : "text"
                },
                "desc":{
                  "type": "text"
                }
              }
          }
        }
      }
    }
    """
    index_body = json.loads(index_body)
    try:
      response = os_client.indices.create(index_name, body=index_body)
      logger.info(f"response received for the create index -> {response}")
    except Exception as e:
      logger.error(f"error in creating index={index_name}, exception={e}")
    • We use Claude 3 Sonnet and Titan Text Embeddings models to convert the JPG images created in the previous notebook into vector embeddings. These embeddings and additional metadata (such as the S3 path and description of the image file) are stored in the index along with the embeddings. The following code snippet shows how Claude 3 Sonnet generates image descriptions.
    def get_img_desc(image_file_path: str, prompt: str):
    # read the file, MAX image size supported is 2048 * 2048 pixels
    with open(image_file_path, "rb") as image_file:
        input_image_b64 = image_file.read().decode('utf-8')
    
    body = json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1000,
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": "image/jpeg",
                                "data": input_image_b64
                            },
                        },
                        {"type": "text", "text": prompt},
                    ],
                }
            ],
        }
    )
    
    response = bedrock.invoke_model(
        modelId=g.CLAUDE_MODEL_ID,
        body=body
    )
    
    resp_body = json.loads(response['body'].read().decode("utf-8"))
    resp_text = resp_body['content'][0]['text'].replace('"', "'")
    
    return resp_text
    • The image descriptions are passed to Titan Text Embeddings model to generate vector embeddings. These embeddings and additional metadata (such as the S3 path and description of the image file) are stored in the index along with the embeddings. The following code snippet shows the call to Titan Text Embeddings model.
    def get_text_embedding(bedrock: botocore.client, prompt_data: str) -> np.ndarray:
    body = json.dumps({
        "inputText": prompt_data,
    })    
    try:
        response = bedrock.invoke_model(
            body=body, modelId=g.FMC_MODEL_ID, accept=g.ACCEPT_ENCODING, contentType=g.CONTENT_ENCODING
        )
        response_body = json.loads(response['body'].read())
        embedding = response_body.get('embedding')
    except Exception as e:
        logger.error(f"exception={e}")
        embedding = None
    
    return embedding
    • The data is ingested into OpenSearch Service Serverless Index by making an API call to the OpenSearch Ingestion pipeline. The following code snippet shows the call made via the requests HTTP library.
    data = json.dumps([{
        "image_path": input_image_s3, 
        "slide_text": resp_text, 
        "slide_number": slide_number, 
        "metadata": {
            "filename": obj_name, 
            "desc": "" 
        }, 
        "vector_embedding": embedding
    }])
    
    r = requests.request(
        method='POST', 
        url=osi_endpoint, 
        data=data,
        auth=AWSSigV4('osis'))
  6. Next Choose 3_rag_inference.ipynb to open it in JupyterLab. When the notebook is open, on the Run menu, choose Run All Cells to run the code in this notebook. This notebook implements the RAG solution: we convert the user question into embeddings, find a similar image description from the vector database and then provide the retrieved description to Claude 3 Sonnet to generate an answer to the user question.

    • We use the following prompt template.
      llm_prompt: str = """
    
      Human: Use the summary to provide a concise answer to the question to the best of your abilities without making anything up.
      <question>
      {question}
      </question>
    
      <summary>
      {summary}
      </summary>
    
      Assistant:"""
    • The following code snippet provides the RAG workflow.
    def get_llm_response(bedrock: botocore.client, question: str, summary: str) -> str:
      prompt = llm_prompt.format(question=question, summary=summary)
    
      body = json.dumps(
      {
          "anthropic_version": "bedrock-2023-05-31",
          "max_tokens": 1000,
          "messages": [
              {
                  "role": "user",
                  "content": [
                      {"type": "text", "text": prompt},
                  ],
              }
          ],
      })
    
      try:
          response = bedrock.invoke_model(
          modelId=g.CLAUDE_MODEL_ID,
          body=body)
    
          response_body = json.loads(response['body'].read().decode("utf-8"))
          llm_response = response_body['content'][0]['text'].replace('"', "'")
    
      except Exception as e:
          logger.error(f"exception while slide_text={summary[:10]}, exception={e}")
          llm_response = None
    
      return llm_response
    
    # create prompt and convert to embeddings
    question: str = "How does Inf2 compare in performance to comparable EC2 instances? I need numbers."
    text_embedding = get_text_embedding(bedrock, question)
    
    # vector db search
    vector_db_response: Dict = find_similar_data(text_embeddings)
    
    # download image for local notebook display
    s3_img_path = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('image_path')
    logger.info(f"going to answer the question=\"{question}\" using the image \"{s3_img_path}\"")
    
    !aws s3 cp {s3_img_path} .
    local_img_path = os.path.basename(s3_img_path)
    display(filename=local_img_path) 
    
    # Ask Claude 3 Sonnet
    slide_text = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('slide_text')
    
    llm_response = get_llm_response(bedrock, question, slide_text)
    print(llm_response)

Results

—————–update with results from Claude————————— Here is a compilation of some user questions and responses generated by our implementation. The Question column captures the user question, and Answer column is the textual response generated by LLaVA. Image is the K-Nearest (k=1) slide match returned by the OpenSearch Service Serverless vector search. This slide is interpreted by LLaVA to provide the Answer

Multimodal RAG results
Question Answer Image
How does Inf2 compare in performance to comparable EC2 instances? I need numbers. According to the slide deck, Inf2 instances by AWS Inferentia2 offer up to 4x higher throughput and 10x lower latency compared to comparable EC2 instances.
As per the AI/ML flywheel, what do the AWS AI/ML services provide? The AWS AI/ML services provide better $/perfer capabilities, new capabilities, and investment in innovation.
Compared to GPT-2, how many more parameters does GPT-3 have? What is the numerical difference between the parameter size of GPT-2 and GPT-3? According to the slide, GPT-3 has 175 billion parameters, while GPT-2 has 1.5 billion parameters. The numerical difference between the parameter size of GPT-2 and GPT-3 is 173.5 billion.
What are quarks in particle physics? I did not find the answer to this question in the slide deck.

Tip

Note that you can use OpenSearch Dashboards to interact with the OpenSearch API to run quick tests on your index and ingested data.

Figure 6: OpenSearch dashboard GET example

Cleanup

To avoid incurring future charges, delete the resources. You can do this by deleting the stack from the CloudFormation console.

Figure 7: Delete CloudFormation Stack

Conclusion

Enterprises generate new content all the time and slide decks are a common mechanism used to share and disseminate information internally with the organization and externally with customers or at conferences. Over time, rich information can remain buried and hidden in non-text modalities like graphs and tables in these slide decks. You can use this solution and the power of multimodal FMs such as Titan Text Embeddings and Claude 3 Sonnet models to discover new information or uncover new perspectives on content in slide decks.

This is part 2 of a 3-part series. We used Amazon Titan Multimodal Embeddings and LLaVA models in part 1. Look out for part 3 where we will compare the approaches from part 1 and part 2.

Portions of this code are released under the Apache 2.0 License as referenced here: https://aws.amazon.com/apache-2-0/


Author bio

Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.



Manju Prasad is a Senior Solutions Architect within Strategic Accounts at Amazon Web Services. She focuses on providing technical guidance in a variety of domains, including AI/ML to a marquee M&E customer. Prior to joining AWS, she has worked for companies in the Financial Services sector and also a startup.



Archana Inapudi is a Senior Solutions Architect at AWS supporting Strategic Customers. She has over a decade of experience helping customers design and build data analytics, and database solutions. She is passionate about using technology to provide value to customers and achieve business outcomes.



Antara Raisa is an AI and ML Solutions Architect at Amazon Web Services supporting Strategic Customers based out of Dallas, Texas. She also has previous experience working with large enterprise partners at AWS, where she worked as a Partner Success Solutions Architect for digital native customers.